Diagnosing Diseases using kNN

An application of kNN to diagnose Diabetes

Jacqueline Razo (Advisor: Dr. Cohen)

2025-04-14

Introduction

  • The k-Nearest-Neighbors (kNN) is an algorithm that is being used in a variety of fields to classify or predict data (Ali et al. 2020)

  • It’s a simple algorithm that classifies data based on how similar a datapoint is to a class of datapoints. (Zhang 2016)

  • One of the benefits of using this algorithmic model is how simple it is to use and the fact it’s non-parametric which means it fits a wide variety of datasets.

  • One drawback from using this model is that it does have a higher computational cost than other models which means that it doesn’t perform as well or fast on big data (Deng et al. 2016)

  • In this project we focused on the methodology and application of classification kNN models in the field of healthcare to predict diabetes.

Methodology Overview

  • The kNN algorithm is a nonparametric supervised learning algorithm that can be used for classification or regression problems. (Syriopoulos et al. 2023)

  • In classification, it classifies a datapoint by finding the nearest k data specified and assigning the category based off the majority.

  • Figure 1 illustrates this methodology with two distinct classes of hearts and circles.

Figure 1

Methodology- The classification process

The classification process has three distinct steps:

  1. Distance Calculation
  2. Neighbor Selection
  3. Classification decision based on majority voting

Methodology - Distance calculation

  1. Distance calculation The knn first measures the distance between the datapoint it’s trying to classify and all the training data points. There are different distance calculation methods that can be used but the default and most commonly used method with the kNN is the Euclidean distance formula (Kataria and Singh 2013).

\[ d = \sqrt{(X_2 - X_1)^2 + (Y_2 - Y_1)^2} \]

Methodology - Neighbor Selection

The kNN allows the selection of a parameter k that is used by the algorithm to choose how many neighbors will be used to classify the unknown datapoint. Studies recommend using cross-validation or heuristic methods, such as setting k to the square root of the dataset size, to determine an optimal value (Zhang 2016).

Figure 1

Methodology - Majority voting

Once the k-nearest neighbors are identified, the algorithm assigns the new data point the most frequent class label among its neighbors. In cases of ties, distance-weighted voting can be applied, where closer neighbors have higher influence on the classification decision.

Figure 1

Assumptions

  • The kNN algorithm assumes similar datapoints will be in close proximity to each other and be neighbors (Zhang 2016).

  • It also assumes that data points with similar features belong to the same class. (Boateng, Otoo, and Abaye 2020)

Pre-processing Data

  • Handle missing values: We must remove the missing values by either inputting them or dropping them to prevent them from skewing the results.
  • Make all values numeric: All categorical values must be encoded using either one-hot encoding or label encoding.
  • Normalize or Standardize the features: We must use min-max scaler or the standard scaler to make sure we reduce bias.
  • Reduce dimensionality: We can use Principal Component Analysis to reduce the number of features but keep the variance.
  • Remove correlated features: The kNN works best when there aren’t too many features, so we can use a correlation matrix to see which features we can drop.
  • Fix class imbalance: Synthetic Minority Over-sampling Technique(SMOTE) can be used to handle class imbalances that can cause biases.

Hyperparameter Tuning

In order to increase the accuracy of the model there are a few parameters that we can adjust.

  1. Find the optimal k parameter: We can use gridsearch to find the best parameter k.
  2. Change the distance metric: The kNN uses the euclidean distance by default but we can use the Manhattan distance, the Minkowski distance or another distance.
  3. Weights: The kNN defaults to a “uniform” weight where it gives the same weight to all the distances but it can be adjusted to “distance” so that the closest neighbors have more weight.

Data Exploration and Visualization

  • We explored the CDC Diabetes Health Indicators dataset, sourced from the UC Irvine Machine Learning Repository. It is a set of data that was gathered by the Centers for Disease Control and Prevention (CDC) through the Behavioral Risk Factor Surveillance System (BRFSS), which is one of the biggest continuous health surveys in the United States.

  • Python and the ucimlrepo package was used to import the dataset directly from the UCI Machine Learning Repository, following the recommended instructions. This enabled us to easily save, prepare, and analyze the data in view of the current research.

Data Exploration and Visualization - Variables

  • The dataset consists of 253,680 survey responses and contains 21 feature variables and 1 binary target variable named Diabetes_binary

  • Diabetes_binary: 0= No Diabetes, 1= Diabetes

  • Binary Variables: HighBP, HighChol, CholCheck, Smoker, Stroke, HeartDiseaseorAttack, PhysActivity, Fruits, Veggies, HvyAlcoholConsump, AnyHealthcare, NoDocbcCost, DiffWalk, Sex.

  • Ordinal Variables: GenHlth, MentHlth, PhysHlth, Age, Education, Income

  • Continuous Variables: BMI

Data Exploration and Visualization - Mean of Features

Figure 4 shows a graph of the mean of different features in the data.

Data Exploration and Visualization - Outliers

Figure 5 shows us outliers in the data that can skew our results

Data Exploration and Visualization - Class imbalance

Figure 6 shows the class imbalance present in the data

Data Exploration and Visualization - Correlation Analysis

A correlation heatmap was generated in Figure 7 to examine relationships between variables. The correlation heatmap helps identify strongly correlated features, which may lead to redundancy in the model.

Data Exploration and Visualization - Key Findings

  • There are no missing values, meaning no imputation is needed.

  • We have some duplicate values that need to be removed.

  • There is a class imbalance with the majority of cases not having diabetes.

Modeling and Results- Data Preprocessing

  • There was no missing data so we didn’t have to remove or impute any values.
  • We started cleaning the data by dropping these duplicates.
  • We kept the ordinal variables the same as they have meaningful natural order that will provide the kNN with meaningful distances.
  • We divided the data into testing and training data. We used test_size=0.2 to use 80% of the data for training the kNN and 20% of the data for testing.
  • We chose to standardize them so that BMI and age could be on the same scale as the other features.

Modeling and Results - Model Creation

We chose to create three classification kNN models to illustrate the methodology.

Table 4: Model Summary
Model Name k value Weights Distance SMOTE
Model 1 5 'uniform' Euclidean No
Model 2 15 'uniform' Euclidean No
Model 3 15 'distance' Euclidean Yes

Modeling and Results- Evaluating the models

The table below shows the summary of the three models.

Table 1: KNN Model Performance Summary
  Model k Weight SMOTE Accuracy F1 Score Precision Recall ROC AUC
0 Model 1 5 Uniform No 83.22% 27.77% 40.66% 21.09% 0.71
1 Model 2 15 Uniform No 84.56% 22.38% 48.37% 14.56% 0.77
2 Model 3 15 Distance Yes 67.77% 39.77% 27.84% 69.58% 0.74

Modeling and Results

  • Model 2 has the highest accuracy at 84.56% but this accuracy score is high because it is good at detecting the non-diabetic cases which are the majority of cases.

  • It also has the highest ROC AUC score of 0.77 which means it’s the best model at seperating different classes; however, the recall is 14.56%.

  • This means the model is only correctly classifying 14.56% of the actual positive cases for diabetes.

  • Model 3 which has an accuracy of 69.77% and a much higher recall of 69.78%. Model 3 is able to correctly identify about 70% of the positive diabetes cases.

Conclusion

  • kNN is a promising algorithmic model that can be further improved to detect diabetes.

  • In this project we created three kNN models that were trained to classify unknown datapoints into diabetes or non-diabetes classes using the data from UC Irvines Machine Learning Repository called CDC Diabetes Health indicators.

  • We were able to see how fine tuning a kNN model can help us detect diabetes in the healthcare setting.

  • Model 2 and 3 showed potential with classifying diabetic cases but would need to be furthered improved by being trained with data that shows more dibetic cases if it’s going to be used in a healthcare setting.

References

Ali, AMEER, MOHAMMED Alrubei, LF Mohammed Hassan, M Al-Ja’afari, and Saif Abdulwahed. 2020. “Diabetes Classification Based on KNN.” IIUM Engineering Journal 21 (1): 175–81.
Boateng, Ernest Yeboah, Joseph Otoo, and Daniel A Abaye. 2020. “Basic Tenets of Classification Algorithms k-Nearest-Neighbor, Support Vector Machine, Random Forest and Neural Network: A Review.” Journal of Data Analysis and Information Processing 8 (4): 341–57.
Deng, Zhenyun, Xiaoshu Zhu, Debo Cheng, Ming Zong, and Shichao Zhang. 2016. “Efficient kNN Classification Algorithm for Big Data.” Neurocomputing 195: 143–48.
Kataria, Aman, and MD Singh. 2013. “A Review of Data Classification Using k-Nearest Neighbour Algorithm.” International Journal of Emerging Technology and Advanced Engineering 3 (6): 354–60.
Syriopoulos, Panos K, Nektarios G Kalampalikis, Sotiris B Kotsiantis, and Michael N Vrahatis. 2023. “K NN Classification: A Review.” Annals of Mathematics and Artificial Intelligence, 1–33.
Zhang, Zhongheng. 2016. “Introduction to Machine Learning: K-Nearest Neighbors.” Annals of Translational Medicine 4 (11).